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Parallel texts (bitexts) have properties that distinguish them from other kinds of paral- 
lel data. First, most words translate to only one other word. Second, bitext correspondence 
is noisy. This article presents methods for biasing statistical translation models to reflect 
these properties. Analysis of the expected behavior of these biases in the presence of sparse 
data predicts that they will result in more accurate models. The prediction is confirmed 
by evaluation with respect to a gold standard — translation models that are biased in 
this fashion are significantly more accurate than a baseline knowledge-poor model. This 
article also shows how a statistical translation model can take advantage of various kinds 
of pre-existing knowledge that might be available about particular language pairs. Even 
the simplest kinds of language-specific knowledge, such as the distinction between con- 
tent words and function words, is shown to reliably boost translation model performance 
on some tasks. Statistical models that are informed by pre-existing knowledge about the 
model domain combine the best of both the rationalist and empiricist traditions. 



1 Introduction 



The idea of a computer system for translating from one language to another is 
almost as old as the idea of computer systems. The earliest written record of 



this idea is a 1949 memorandum by Warren Weaver. More recently, Brown et al. 



(1988 ) have proposed methods for constructing machine translation systems auto- 



matically. Instead of codifying the human translation process from introspection, 
Brown et al. appealed to machine learning techniques to induce models of the 
process from examples of its input and output. The proposal generated much 
excitement, because it held the promise of automating a task that fifty years 
of research have proven extremely labor-intensive and error-prone. Yet, very few 
other researchers have taken up the cause, partly because Brown et a/. 's approach 
was quite a departure from the paradigm in vogue at the time. 

Formally, Brown et al\ built statistical models of translational equivalence (or 
translation models]^, for short). Translational equivalence is a relation that 
holds between two expressions with the same meaning, where the two expressions 
are in different languages. As with all statistical models, the best translation mod- 
els are those whose parameters correspond best with the sources of variance in the 
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1 Note that the term translation model, which is standard in the literature, refers to a static 

mathematical relationship between two data sets. In this usage, the term says nothing about the 

process of translation, automated or otherwise. 
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data. Translation models whose parameters reflect existing knowledge about par- 
ticular languages and language pairs and/or universal properties of translational 
equivalence benefit from the best of both the empiricist and rationalist tradi- 
tions. This article presents three such models, along with methods for efficiently 
estimating their parameters. 

More specifically, in this article, I introduce methods for modeling three uni- 
versal properties of translational equivalence in parallel texts (bitexts): 

1. Most word tokens translate to only one word token. I capture this 
tendency in a one-to-one assumption. 

2. Most text segments are not translated word-for-word. I build an explicit 
noise model. 

3. Different linguistic objects have statistically different behavior in 
translation. I show a way to condition translation models on different 
word classes to help account for the variety. 

Quantitative evaluation with respect to a gold standard has shown that each of 
the three methods effects a significant improvement in translation model accuracy. 

A review of previously published translation models follows an introduction 
to the different kinds of possible translation models. The core of the article is a 
presentation of the model estimation biases described above and an analysis of 
their expected behavior in the face of sparse data. The last section reports the 
results of a variety of experiments designed to evaluate these innovations. 

Throughout this article, I shall use CACCZQTZAVHTC letters to denote en- 
tire text corpora and other sets of sets, CAPITAL letters to denote collections, 
including strings and bags, and italics for scalar variables. I shall also distinguish 
between types and tokens by using bold font for the former and plain font for 
the latter. 



2 Translation Model Decomposition 



There are two kinds of applications of translation models: those where word order 
plays a crucial role and those where it doesn't. Empirically estimated models of 
translational equivalence among word types can play a central role in both kinds 
of applications. 

Applications where word order is not important (or at least not essential) 
include 



cross-language information retrieval {e.g. Oard &: Dorr, 1996| ), 
computer-assisted language learning ([Nerbonne et al, 19971 ), 



certain machine-assisted translation tools {e.g. Macklovitch, 1994; 
Melamed, 1996a|) , 
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concordancing for bilingual lexicography ( patizone et ai, 1993 
pale fc Church, 1991| ), 



corpus linguistics {e.g. pvartvik, 1992 ) 



"crummy" machine translation on the internet ( Church &: Hovy, 1993 ). 



For these applications, empirical models have a number of advantages over hand- 
crafted models such as on-line versions of printed bilingual dictionaries. Two of 
the advantages are the possibility of better coverage and the possibility of frequent 
updates by non-expert users to keep up with rapidly evolving vocabularies. 

A third advantage is that empirical models can provide more accurate infor- 
mation about the relative importance of different translations. Such information 
is crucial for applications such as cross-language information retrieval (CLIR). 
In CLIR, the query vector Q' is in a different language (a different vector space) 
from the document vectors D. Matrix multiplication by a word-to- word transla- 
tion model T can map Q' into a vector Q in the vector space of D: Q = Q'T. 
In order for the mapping to be accurate, T must be able to encode many levels 
of relative importance among the possible translations of each element of Q'. A 
typical machine-readable bilingual dictionary says only what the possible transla- 
tions are, which is equivalent to positing a uniform translational distribution. The 
performance of cross-language information retrieval with a uniform T is likely to 
be limited in the same way as the performance of conventional information re- 
trieval without term frequency information, i.e. where the system knows which 



terms occur in which documents, but not how often ( Buckley, 1993 ). 

Fully automatic high-quality machine translation is the prototypical applica- 
tion where word order is crucial. In such an application, a word-to-word transla- 
tion model can serve as an independent module in a more complex string-to-string 
translation model. The independence of such a module is desirable for two rea- 
sons, one practical and one philosophical. The practical reason is illustrated in 
this article: Order-independent translation models can be accurately estimated 
more efficiently in isolation. The philosophical reason is that words are an impor- 
tant epistemological category in our naive mental representations of language. We 
have many intuitions (and even some testable theories) about what words are and 
how they behave. We can bring these intuitions to bear on our translation models 
without being distracted by other facets of language, such as phrase structure. 
For example, Chapter 9 of my dissertation is based on the intuition that words 



can have multiple senses (iMelamed, 1998fc| ); [Brown et al. (1993D 's Model 3 and 



my work on non-compositional compounds ( Melamed, 1997b| ) are based on the 
intuition that spaces in text do not necessarily delimit words. 

The independence of a word-to-word translation module in a string-to-string 
translation model can be effected by a two-stage decomposition. The first stage 
is based on the observation that every string S is just an ordered bag, and that 
the bag B can be modeled independently of its order O. For example, the string 
{abc) consists of the bag {c, a, 6} and the ordering relation {(6, 2), (a, 1), (c, 3)}. 
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If we represent each string S as a pair (B, O), then 

Pr(S) = Pr(B,0) (1) 
= Pr(B) • Pr(0|B). (2) 

Now, let Si and S2 be two strings and let A be a one-to-one mapping between 
the elements of Si and the elements of S2. Borrowing a term from the operations 
research literature, I shall refer to such mappings as assignments^. Let A be 
the the set of all possible assignments between Si and S2. Using assignments, we 
can decompose conditional and joint probabilities over strings: 

Pr(Si|S2) = ^Pr(Si,AjS2) (3) 

Pr(Si,S2) = 5]Pr(Si,A,S2) (4) 

AeA 

where 

Pr(Si,A|S2) = Pr(Bi,Oi,A|S2) (5) 

= Pr(Bi,A|S2)-Pr(Oi|Bi,A,S2) (6) 

Pr(Si,A,S2) = Pr(Bi,Oi,A,B2,02) (7) 

= Pr(Bi,A,B2)-Pr(Oi,02iBi,A,B2) (8) 

The second stage of decomposition takes us from bags of words to the words 
that they contain. The following bag-pair generation process illustrates how a 
word-to-word translation model can be embedded in a bag-to-bag translation 
model for languages LI and L2: 

1. Generate a bag size b with probability Z{b) (mnemonic: Z is the siZe 
distribution), b is also the assignment size. 

2. Generate b language-independent concepts Ci, . . . , Cf,. 

3. From each concept Ci, 1 < i < 6, generate a pair of word strings {ui,Vi) 
from LI* X L2* , according to the distribution trans{u,v), to lexicalize 
the concept in the two languages. Some concepts are not lexicalized in 
some languages, so one of Ui and Vi may be empty. 

A pair of bags containing m and n non-empty word strings can be generated by 
a process where b is anywhere between 1 and m + n. 

Without loss of generality, we can assume that each different pair of word 
string types (ii, v) is deterministically generated from a different concept. Thus, 
a bag-to-bag translation model can be fully specified by the distributions Z and 



2 Assignments are different from Brown et al. {1993)'s alignments in that assignments can range over 
pairs of arbitrary labels, not necessarily string position indexes. Also, unlike alignments, 
assignments must be one-to-one. 
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trans. For notational convenience, the elements of the two bags can be labeled 
so that Bi = {vTi, . . . ,Ub} and B2 = {vi, . . . , Vb}, where some of the u's and 
v's may be empty. The elements of an assignment, then, are pairs of bag element 
labels: A = {{ii, ji), ■■■,{%, jb)}, where each i ranges over {iTi, . . . , ut,}, each 
j ranges over {vi, . . . ,^1,}, each i is distinct and each j is distinct. The label 
pairs in a given assignment can be generated in any order, so there are 6! ways 
to generate an assignment of size It follows that the probability of generating 
a pair of bags (Bi,B2) with a particular assignment A of size b is 

Pr(Bi, A, B2IZ, trans) = • 6! Y[ trans(ui,Vj) (9) 

{«,i)GA 

The joint probability distribution trans{u,v) is a word-to-word translation 
model. 



3 The One-to-One Assumption 



The most general word-to- word translation model trans{u,v), where u and v 
range over the strings of LI and L2, has an infinite number of parameters. This 
model can be constrained in various ways to make it more practical. The models 
presented in this article are based on the one-to-one assumption: Each word 
is translated to at most one other word. In these models, u and v may consist 
of at most one word each. As before, one of the two strings (but not both) may 
be empty. I shall describe empty strings as consisting of a special null word, 
so that each word string will contain exactly one word and can be treated as a 
scalar. Henceforth, I shall write u and v instead of u and v. Under the one-to-one 
assumption, a pair of bags containing m and n non-empty words can be generated 
by a process where the bag size b is anywhere between max(m, n) and m + n. 

The one-to-one assumption is not as restrictive as it may appear: The explana- 
tory power of a model based on this assumption may be raised to an arbitrary 
level by redefining what words are. For example, I have shown elsewhere how to 
efficiently estimate word-to-word translation models where a word can be a non- 



compositional compound consisting of several space-delimited tokens ( Melamed 



|1997b ). For the purposes of this article, however, words are the tokens gener- 
ated by my tokenizers and stemmers for the languages in question. Therefore, the 
models in this article are only a first approximation to the vast complexities of 
translational equivalence. They are intended mainly as stepping stones towards 
better models. 



4 Previous Work 



Most methods for estimating translation models from bitexts start with the fol- 
lowing intuition: Words that are translations of each other are more likely to 



3 The number of permutations is smaller wlien either bag contains two or more identical elements, 
but this detail will not affect the estimation algorithms presented here. 
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appear in corresponding bitext regions than other pairs of words. Following this 
intuition, most authors begin by counting the number of times that word types 
in one half of the bitext co-occur with word types in the other half. Different 
co-occurrence counting methods stem from different models of co-occurrence. 

A model of co-occurrence is a boolean predicate, which indicates whether 
a given pair of word tokens co-occur in corresponding regions of the bitext space. 
Different models of co-occurrence are possible, depending on the kind of bitext 
map that is available, the language-specific information that is available, and the 
assumptions made about the nature of translational equivalence. All the transla- 
tion models reviewed and introduced in this article can be based on any of the 



co-occurrence models described by Melamed (1998a). For expository purposes, 



however, I shall assume a boundary-based model of co-occurrence throughout 
this article. A boundary-based model of co-occurrence assumes that both halves 
of the bitext have been segmented into s segments, so that segment Ui in one half 
of the bitext and segment Vi in the other half are mutual translations, 1 < i < s. 
Under this model of co-occurrence, the co-occurrence count cooc{u, v) for word 
types u and v is the number of times that u G C/j and v G in some aligned 
segment pair i. 

4.1 Non-Probabilistic Translation Lexicons 

Many researchers have proposed greedy algorithms for estimating non-probabilistic 



word-to- word translation models, also known as translation lexicons {e.g. Cati 



^one et al, 1993| ; |Gale fc Church, 1991| ; |Fung, 1995| ; [Kumano k Hirakawa, 1994| ; 



Melamed, 1995t [Wu &: Xia, 1995| ). Most of these algorithms can be summarized 



as follows: 

1. Choose a similarity function 5* between word types in LI and word 
types in L2. 

2. Compute association scores S'(u,v) for a set of word type pairs 
(u,v) G (LI X L2) that occur in training data. 

3. Sort the word pairs in descending order of their association scores. 

4. Discard all word pairs for which S'(u, v) is less than a chosen threshold t. 
The remaining word pairs become the entries in the translation lexicon. 

The various proposals differ mainly in their choice of similarity function. Almost 
all the similarity functions in the literature are based on a model of co-occurrence 



with some linguistically-motivated filtering (see Fung, 1995, for a notable excep- 
tion). 

Given a reasonable similarity function, the greedy algorithm works remark- 
ably well, considering how simple it is. However, the association scores in Step 2 
are typically computed independently of each other. The problem with this inde- 
pendence assumption is illustrated in Figure |^. The two strings represent corre- 
sponding regions of a bitext. If q and v co-occur much more often than expected 
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q < — 



u V w . . . 

Figure 1 

q and v often co-occur, as do q and r. The direct association between q and v, and the direct 
association between q and r give rise to an indirect association between v and r. 



by chance, then any reasonable similarity metric will deem them likely to be mu- 
tual translations. If q and v are indeed mutual translations, then their tendency 
to co-occur is called a direct association. Now, suppose that q and r often 
co-occur within their language. Then v and r will also co-occur more often than 
expected by chance. The arrow between v and r in Figure || represents an indi- 
rect association, since the association between v and r arises only by virtue of 
the association between each of them and q. Models of translational equivalence 
that are ignorant of indirect associations have "a tendency ... to be confused by 
collocates" ( Pagan et ai, 199^ ). 



Paradoxically, the irregularities (noise) in text and in translation mitigate the 
problem. If noise in the data reduces the strength of a direct association, then the 
same noise will reduce the strengths of any indirect associations that are based 
on this direct association. On the other hand, noise can reduce the strength of an 
indirect association without affecting any direct associations. Therefore, direct 
associations are usually stronger than indirect associations. If all the entries in a 
translation lexicon are sorted by their association scores, the direct associations 
will be very dense near the top of the list, and sparser towards the bottom. 

Gale & Church (1991) have shown that entries at the very top of the list can 
be over 98% correct. Their algorithm gleaned lexicon entries for about 61% of 
the word tokens in a sample of 800 English sentences. To obtain 98% precision, 
their algorithm selected only entries for which it had high confidence that the 
association score was high. These would be the word pairs that co-occur most 
frequently. A random sample of 800 sentences from the same corpus showed that 
61% of the word tokens, where the tokens are of the most frequent types, represent 
4.5% of all the word types. A similar strategy was employed by |Wu fc Xia (1995| ) 



and by Fung (1995| ).F1 Fung skimmed off the top 23.8% of the noun-noun entries 



in her lexicon to achieve a precision of 71.6%. Wu & Xia have reported automatic 
acquisition of 6517 lexicon entries from a 3.3-million-word corpus, with a precision 
of 86%. The first 3.3 million word tokens in an English corpus from a similar 
genre contained 33490 different word types, suggesting a recall of roughly 19%. 



4 These two results should not be judged on the same scale, because it is arguably more difRcult to 
construct translation lexicons between English and Chinese than between English and French. 
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Note, however, that Wu & Xia chose to weight their precision estimates by the 
probabihties attached to each entry: 

For example, if the translation set for English word detect has 
the two correct Chinese candidates with 0.533 probability and 
with 0.277 probability, and the incorrect translation with 0.190 
probability, then we count this as 0.810 correct translations and 
0.190 incorrect translations. ( |Wu fc Xia, 1995| , p. 211) 



This is a reasonable evaluation method, but it is not comparable to methods 
that simply count each lexicon entry as either right or wrong {e.g. paille et al. 



19{|4|; [Melamed, 1996b| ). A weighted precision estimate pays more attention to 
entries that are more frequent and hence easier to estimate. Therefore, weighted 
precision estimates are generally higher than unweighted ones. 

4.2 Re-estimated String-to-String Translation Models 

Most translation model re-estimation algorithms published to date are variations 



on the theme proposed by Brown et al. (1993 ). These models involve conditional 



probabilities, but they can be compared to symmetric models based on joint 
probabilities if the latter are normalized by the appropriate marginal distribution. 
I shall review these models using the notation in Table |l|. 

(p(,V) = the two halves of the bitext 

{U, V) = a pair of aligned text segments in {U, V) 

e(u) = the frequency of u in [/ 

/(v) = the frequency of v in y 

cooc{u, v) = the number of times that u and v co-occur 

links(u,v) = the number of times that u and v are hypothesized to 

co-occur as mutual translations 
trans{v\u) = the probability that a token of u will be translated as 

a token of v 

Tablel^ 

Variables used to describe translation models. 

Methods for estimating translation parameters from co-occurrence counts in- 
variably involve link counts links{u,v), which represent hypotheses about the 
number of times that u and v were generated together from the same language- 
independent concept, for each u and v in the bitext. A link token is an ordered 
pair of word tokens, one from each half of the bitext. A link type is an ordered 
pair of word types. The link counts links{u,v) range over link types. 

4.2.1 Models Using Only Co-occurrence Information Brown et aZ.'s Model 1 
is estimated from co-occurrence information only, using the Expectation-Maximization 



(EM) algorithm ( Dempster et al., 1977 ). 
E step: 

^^"'^("'")= ^ y 'Tan!(![uO ^("^-^^") 
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M step: 

/ , s links(u,v) 

It is instructive to consider the form of Equation |lO| when all the translation 
probabilities trans{v\u) for a particular u are initialized to the same constant p, 
as Brown et al. (199"3| , p. 273) actually do: 



The initial link count for each (u, v) pair is set proportional to the co-occurrence 
count of u and v and inversely proportional to the length of each segment U in 
which u occurs. The intuition behind the numerator is central to most bitext- 
based translation models: The more often two words co-occur, the more likely 
they are to be mutual translations. The intuition behind the denominator is that 
the co-occurrence count of u and v should be discounted to the degree that v 
also co-occurs with other words in the same segment pair. 

Now consider how Equation would behave under a distance-based model 
of co-occurrence (iMelamed, 1998a| Ti where each token of v co-occurs with exactly 
c words (where c is constant): 

Unksiu,.) = ^ £Hl/M (14) 

{U,V)GiU,V) ^ 

(c/,y)G(w,v) 

The discount factor ^ disappears in the M step. The only difference between 
Equations 13 and ^ is that the former discounts co-occurrences proportionally 
to the segment lengths. When information about segment lengths is not available, 
Model I's initial parameters boil down to co-occurrence counts. 

4.2.2 Word Order Correlation Biases In any bitext, the positions of words 
with respect to the true bitext map correlate with the positions of their trans- 
lations. The correlation is stronger for language pairs with more similar word 
order. Brown et al. (1988) introduced the idea that this correlation can be en- 



coded in translation model parameters. Pagan et al. (1993| ) expanded on this 



idea by replacing Brown et aZ.'s word alignment parameters, which were based 
on absolute word positions in aligned segments, with a much smaller set of relative 
offset parameters. The much smaller number of parameters allowed Dagan et aZ.'s 



model to be effectively trained on much smaller bitexts. Vogel et al. (199^ ) have 



shown how some additional independence assumptions can turn this model into 
a Hidden Markov Model, enabling even more efficient parameter estimation. 
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It cannot be overemphasized that the word order correlation bias is just 
knowledge about the problem domain, which can be used to guide the search 
for the optimum model parameters. Translational equivalence can be empirically 
modeled for any pair of languages, but some models and model biases work better 
for some language pairs than for others. The word order correlation bias is most 
useful when it has high predictive power, i.e. when the distribution of alignments 
or offsets has low entropy. The entropy of this distribution is indeed relatively low 
for the language pair that both Brown et al. and Dagan et al. were working with 
— French and English have very similar word order. A word order correlation 
bias would be of less benefit with noisier training bitexts or for language pairs 
with less similar word order. The same is true of the phrase structure biases in 
Brown et al. (1993| )'s Models 4 and 5. 



4.3 Re-estimated Bag-to-Bag Translation Models 

At about the same time that I developed the models in this article, Hiemstra 
(1996) independently developed his own bag-to-bag model of translational equiv- 



alence. His model is also based on a one-to-one assumption, but it differs from 
my models in that it does not allow empty words to be generated. I.e., it assumes 
that the two bags in each pair contain the same number of words. His estimation 
method is also different in that it does not impose any structure on the hidden 
parameters: Whereas my estimation methods revolve around assignments (see 
Equation , [Hiemstra has modeled pairs of word bags as a multinomial over the 
crossproduct of two vocabularies. Maximum likelihood parameter estimation is 



computationally too expensive for Hiemstra 's model, so he proposed the Iterative 
Proportional Fitting Procedure (IPFP) ( Pemming fc Stephan, 1940| ) as a cheaper 
approximation method. 



The IPFP is quite sensitive to initial conditions, so Hiemstra investigated 



a number of initialization options. Choosing the most advantageous, Hiemstra 
has published parts of the translational distributions of certain words, induced 



using both his method and Brown et al. (1993 )'s Model 1 from the same training 



bitext. Subjective comparison of these examples suggests that Hiemstra 's method 



is more accurate. Hiemstra (1998 ) has also evaluated the recall and precision of 
his method and of Model 1 on a small hand-constructed set of link tokens in a 
particular bitext. Model 1 fared worse, on average. 



5 Parameter Estimation 



This section describes my methods for estimating the translation parameters of a 
symmetric word-to- word translation model from a bitext. For most applications, 
we are interested in estimating the probability trans{u, v) of generating the pair 
of words (u, v). For estimation purposes, however, it is more convenient to deal 
with likelihoods like{u, v), the likelihood that u and v can ever be mutual trans- 
lations, i.e. that there exists some context where tokens u and v are generated 
from the same concept. There are various possible definitions for like{u,v) and 
the relationship between like{u,\-) and trans{u,\-) can be more or less direct. 
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depending on the model. The maximum likelihood estimate of trans{u,v) can 
always be derived by normalizing the link counts so that J2uv'f''''(^^^i'^^'^) — 1- 

links(u,v) 

trans{u,v) = — ,. , , , ^ (16) 

Link counts, and therefore also the translation parameters trans{u,v), can- 
not be directly observed in a training bitext, because we don't know which words 
in one half of the bitext were generated together with which words in the other 
half. The observable features of the bitext are only the co-occurrence counts 
cooc(u, v) (see Section^. All my methods for estimating the translation param- 
eters trans{u,v) from the co-occurrence counts cooc(u, v) share the following 
general outline: 

1. Initialize the model parameters to a first approximation. 

2. Estimate the link counts links{u,v), as a function of the model 
parameters and the co-occurrence counts. 

3. Estimate the model parameters like{u,v), as a function of the link 
counts and the co-occurrence counts. 

4. Repeat from Step 2, until the model converges to the desired degree. I 
have adopted the simple heuristic that the model has converged when 
less than .0001 of the trans{u,v) distribution changes from one 
iteration to the next. 

5. Compute the maximum likelihood estimate (MLE) of trans{u,v), by 
normalizing the converged link counts as in Equation 16. 



Under certain conditions, a parameter estimation process of this sort is an in- 
stance of the Expectation-Maximization (EM) algorithm ( Dempster et al, 1977| ). 



As explained below, meeting these conditions is computationally too expensive 
for my models. Therefore, I employ some approximations, which lack the EM 
algorithm's convergence guarantee. 

The maximum likelihood approach to estimating the unknown parameters is 
to find the set of parameters O that maximize the probability of the training 
bitext {U,V). 

e = argmaxPr(C/, F|G) (17) 

where the probability of the bitext is a weighted sum over the distribution A of 
possible assignments: 

Fi{u,v\e) = J2 P^{u,A,v\e). (is) 

The MLE method is infeasible, because the number of possible assignments grows 
exponentially with the size of the bitext. Due to the parameter inter dependen- 
cies introduced by the one-to-one assumption, we cannot decompose the assign- 
ments into parameters that can be estimated independently of each other (as in 
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Brown et al, 1993, Equation 26). This is why we must make do with approxima- 
tions to the EM algorithm. 



In this situation, Brown et al. (1995 , p. 293) recommend "evaluating the 
expectations using only a single, probable alignment." The single most probable 
assignment A^ax is called the Viterbi assignment: 



Ar, 



arg maxPr([/, yl, 1/ 10) 
argmaxZ(6) • 6! Y\_ trans{ux,Vy) 

{x,y)eA 



arg max log 

AeA 



Z{b)-bl Y\_ trans{ux,Vy) 

{x,y)eA 



(19) 
(20) 

(21) 



= arg max < log[Z(6) • bl] + ^ log trans (u^,, Vy) 
^ I {x,y)eA 

To simplify things further, let us assume that Z(b) ■ bl is constant, so that 



Ay^ 



arg max ^ log trans(u^, Vy) 

{x,y)€A 



(22) 



(23) 



If we represent the bitext as a bipartite graph and weight the edges by log trans(u, v) , 
then the right-hand side of Equation |2^ is an instance of the weighted maximum 
matching problem and Amax is its solution. For a bipartite graph G = (V1UV2, E), 
with u = U V2I and e = \E\, the lowest currently known upper bound on the 
computational complexity of this problem is 0{ve + v'^ log v) ( Ahuja et al, 1993 , 
p. 500). Although this upper bound is polynomial, it is still too expensive for typ- 
ical bitexts. The next subsection describes a greedy approximation to the Viterbi 
approximation. 



5.1 Method A: The Competitive Linking Algorithm 
5.1.1 Step 1: Initialization Almost every published translation model estima- 
tion algorithm exploits the well-known correlation between the link likelihoods 
/^fce(u, v) and the co-occurrence counts cooc(u, v). As discussed in Section ^ 
many algorithms also normalize the correlation by the marginal frequencies of 
u and V. However, these quantities account for only three of the cells in the 
following contingency table: 





u 


-lU 


Total 


V 


cooc{u, v) 


cooc(-iu, v) 


cooc{-,v) 


-iV 


cooc{u, -iv) 


cooc(-iu, -iv) 


cooc{-, -iv) 


Total 


cooc{u, •) 


cooc(-iu, •) 


cooc{-, •) 



The statistical interdependence between two word types can be estimated more 
robustly by considering the whole table. For example. Gale fc Church (1991| ) 
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suggest that 



a x^-like statistic, seems to be a particularly good choice be- 



cause it makes good use of the off-diagonal cells" of the contingency table. In 
informal experiments reported elsewhere ( Melamed, 1995| ), I found that the 
statistic suggested by [Dunning (1993D slightly outperforms (j)'^. Let the cells of 
the contingency table be named as follows: 





u 


-lU 


V 


a 


b 


-iV 


c 


d 



Now, 



u, V 



-2 log 



B{a\a + h,pi)B{c\c + d,p2) 
B{a\a + b,p)B[c\c + d,p) 



(24) 



where B{k\n,p) = ^ J p {1 — p)"" are binomial probabilities. The statis- 
tic uses maximum likelihood estimates for the probability parameters: pi = 
P2 = ) P = a+b+c+d ■ compute because the binomial coefficients 

cancel out. All my methods initialize the model parameters like{u, v) to G^(u, v) , 
except that the likelihood of any word being linked to null is initialized to an 
infinitesimal value. I have also found it useful to smooth the co-occurrence counts 
using the Simple Good- Turing smoothing method ( Gale &: Sampson, 1995| ) before 
computing G^. 



5.1.2 Step 2: Estimation of Link Counts To further reduce the complex- 
ity of estimating the model parameters, I employ the competitive linking 
algorithm, which is a greedy approximation to the Viterbi approximation: 

l.Sort all the translation likelihood estimates like{u,v) from highest to 
lowest. 

2. For each likelihood estimate like{u,v), in order: 

(a) If u (resp., v) is null, consider all tokens of v (resp., u) in the 
bitext linked to null. Otherwise, link all co-occurring token 
pairs {u, v) in the bitext. 

(b) The one-to-one assumption implies that linked words cannot be 
linked again. Therefore, remove all linked word tokens from 
their respective halves of the bitext. 

The competitive linking algorithm can be viewed as a heuristic search for the most 
likely assignment in the space of all possible assignments. The search heuristic 
is that the most likely assignments contain links that are individually the most 
likely. The search proceeds by a process of elimination. In the first search iteration, 
all the assignments that do not contain the most likely link are discarded. In the 
second iteration, all the assignments that do not contain the second most likely 
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link are discarded, and so on until only one assignment remain^ The algorithm 
greedily selects the most likely links first, and then selects less likely links only 
if they don't conflict with previous selections. The probability of a link being 
rejected increases with the number of links that are selected before it, and thus 
decreases with the link's likelihood. In this problem domain, the competitive 
linking algorithm usually finds one of the most likely assignments, as I will show 
in Section 0. Under an appropriate hashing scheme, the expected running time 
of the competitive linking algorithm is linear in the size of the input bitext. 



5.1.3 Step 3: Re-estimation of the Model Parameters Method A re- 
estimates the model parameters simply by normalizing the link counts to sum to 



1 as in Equation 16. The competitive linking algorithm only cares about the rel- 



ative magnitudes of the various like{u,v). However, Equation 19 is a sum rather 
than a product, so I scale the parameters logarithmically, to be consistent with 
its probabilistic interpretation: 

like{u,v) = log trans(u, v) (25) 

5.2 Method B: Improved Estimation Using an Explicit Error Model 

Yarowsky (1993) has shown that "for several definitions of sense and collocation, 
an ambiguous word has only one sense in a given collocation with a probability of 
90-99%." In other words, a single contextual clue can be a highly reliable indicator 
of a word's sense. One of the definitions of "sense" studied by Yarowsky was a 
word token's translation in the other half of a bitext. For example, the English 
word sentence may be considered to have two senses, corresponding to its French 
translations peine (judicial sentence) and phrase (grammatical sentence). If a 
token of sentence occurs in the vicinity of a word like jury or prison, then it is far 
more likely to be translated as peine than as phrase. "In the vicinity of" is one 
kind of collocation. Co-occurrence in bitext space is another kind of collocation. 
If each word's translation is treated as a sense tag (iResnik fc Yarowsky, 1997|) , 
then "translational" collocations have the unique property that the collocate and 
the word sense are one and the same! 

Method B exploits this property under the hypothesis that "one sense per 
collocation" holds for translational collocations. This hypothesis implies that if u 
and V are possible mutual translations, and a token u co-occurs with a token v in 
the bitext, then with very high probability the pair {u, v) was generated from the 
same concept and should be linked. To test this hypothesis, I ran one iteration of 
Method A on 300000 aligned sentence pairs from the Canadian Hansards bitext. 
I then plotted the ratio ^"^^(^"v)^ for several values of cooc{u, v) in Figure ^. The 

bimodality of the surface shows that the ratio ^^^^^(^^'^^ tends to be either very 



5 Given a method of assigning probabilities to assignments, the competitive linking algorithm can be 
generalized to stop searching before the number of possible assignments is reduced to one, at which 
point the Jink counts can be computed as weighted averages over the remaining assignments using 
Equation lisl 
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Figure 2 

A fragment of the joint frequency ( 'eooe('u v)^ ' cooc{u, v)). Note that the frequencies are plotted 
on a log scale — the bimodality is quite sharp. 



high or very low. Note that the frequencies are plotted on a log scale — the 
bimodality is quite sharp. 

Information about how often words co-occur without being linked can be 
used to bias the estimation of translation model parameters. The smaller the ra- 
tio ^^"^■^("''^) ii^Q more likely it is that u and v are not mutual translations, and 

cooc(u,v) ' ' 

that links posited between tokens of u and v are noise. The bias can be imple- 
mented via auxiliary parameters that model the curve illustrated in Figure ^. The 
competitive linking algorithm creates all the links of a given type independently 
of each othei]^ So, the distribution of the number links{u, v) of links connecting 
word types u and v can be modeled by a binomial distribution with parameters 
cooc{u, v) and p{u, v). p{u, v) is the probability that u and v will be linked when 
they co-occur. There is never enough data to robustly estimate each p parame- 
ter separately. Instead, I shall model the p's via only two distinct parameters. If 
u and V are mutual translations, then p{u, v) will average to a relatively high 
probability, which I will call A^. If u and v are not mutual translations, then 
p(u, v) will average to a relatively low probability, which I will call A~. A"*" and 
A^ correspond to the two peaks of the distribution of a fragment of 

which is illustrated in Figure |2[ The two parameters can also be interpreted as 
the rates of true and false positives. If the translation in the bitext is consistent 
and the translation model is accurate, then A"*" will be close to 1 and A~ will be 
close to 0. 



6 Except for the case when multiple tokens of the same word type occur near each other, which I 
hereby sweep under the carpet. 
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To find the most likely values of the auxiliary parameters and A^, I adopt 
the standard method of maximum likelihood estimation, and find the values that 
maximize the probability of the link frequency distributions, under the usual 
independence assumptions, where 

PT{links\niodel) = JJ^ Pr(/mA;s(u, v)|cooc(u, v), A^, A~). (26) 

u,v 

The factors on the right-hand side of Equation ^ can be written explicitly with 
the help of a mixture coefficient. Let r be the probability that an arbitrary co- 
occurring pair of word types are mutual translations. Let B(k\n,p) denote the 
probability that k links are observed out of n co-occurrences, where k has a 
binomial distribution with parameters n and p. Then the probability that two 
arbitrary word types u and v are linked links{u,v) times out of cooc(u, v) co- 
occurrences is a mixture of two binomials: 

Pv{l^nks{u,^r)\cooc{u,v), , X~) = ri?(ZmA;s(u, v)|cooc(u, v). A"*") (27) 

+ {1 — t)B [links {\i,v)\cooc{u,v), X~) 

One more variable allows us to express r in terms of A^ and A^: Let A be 
the probability that an arbitrary co-occuring pair of word tokens will be linked, 
regardless of whether they are mutual translations. Since r is constant over all 
word types, it also represents the probability that an arbitrary co-occurring pair 
of word tokens are mutual translations. Therefore, 

A = rA+ + (1 -r)A-. (28) 

A can also be estimated empirically. Let K be the total number of links in the 
bitext and let A'^ be the total number of co-occuring word token pairs: 

K = ^Unks{u,v), (29) 

u,v 

N = J2cooc{u,v). (30) 

u,v 

By definition, 

A = K/N. (31) 



Equating the right-hand sides of Equations |28| and ^ and rearranging the terms 
we get: 

K/N - X" 



A+ - A- 



(32) 



Since r is now a function of A^ and A , only the latter two variables represent 
degrees of freedom in the model. 



The probability function expressed by Equations |2^ and 27 may have many 
local maxima. In practice, these local maxima are like pebbles on a mountain, 
invisible at low resolution. I computed Equation [2^ over various combinations 
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of A"*" and A~ after one iteration over 300000 aligned sentence pairs from the 
Canadian Hansard bitext. Figure ^ illustrates that the region of interest in the 
parameter space, where 1 > A"*" > A > A~ > 0, has only one dominant global 
maximum. This global maximum can be found by standard hill-climbing methods, 
as long as the step size is large enough to avoid getting stuck on the pebbles. 

Given estimates for A"*" and A~, we can compute B{links{u, v)|cooc(u, v), A"'") 
and B{links{u,w)\cooc{u,v), \~) for each occurring combination of links and 
cooc values. These are the probabilities that links{u,v) links were generated out 
of cooc(u,v) possible links by a process that generates correct links and by a 
process that generates incorrect links, respectively. The ratio of these probabil- 
ities is the likelihood ratio in favor of the types u and v being possible mutual 
translations, for all u and v: 

5(/mA;s(u,v)Icooc(u,v), A+) 

hke{u, v) = log r- -— (33) 

B{Links[\i,v)\cooc[\i,v),X ) 

In the preceding equations, either u or v can be null. However, the number of 
times that a word co-occurs with null is not an observable feature of bitexts. To 
make sense of co-occurrences with null, we can view co-occurrences as potential 
links and cooc{u, v) as the maximum number of times that tokens of u and v 
might be linked. From this point of view, cooc{u, null) should be set to the 
marginal frequency of u, since each token of u represents one potential link to 
NULL. These co-occurrence counts should be summed together with all the others 
in Equation 

Method B differs from Method A only in its use of the auxiliary parameters 



in Equation 32 to re-estimate the model parameters. These parameters and the 
error model that they represent can be employed the same way in translation 
models that are not based on the one-to-one assumption. An interesting property 
of Equation ^ is that it is possible, for a given word type u, that like{u,v) < 
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for all V including NULL. These are the words about which the model is uncertain, 
and they represent fertile ground for future work. 

5.3 Method C: Improved Estimation Using Pre-Existing Word Classes 

In Method B, the estimation of the auxiliary parameters and depends only 
on the co-occurrence counts and on the distributions of link frequencies gener- 
ated by the competitive linking algorithm. All word pairs that co-occur the same 
number of times and are linked the same number of times are assigned the same 
like value. More accurate models can be induced by taking into account various 
features of the linked tokens. For example, frequent words are translated less con- 



sistently than rare words (Melamed, 1997a). To account for these differences, we 



can estimate separate values of and A~ for different ranges of cooc{u, v) . Sim- 
ilarly, the auxiliary parameters can be conditioned on the linked parts of speech. 
A kind of word order correlation bias can be effected by conditioning the auxil- 
iary parameters on the relative positions of linked word tokens in their respective 
texts. Just as easily, we can model link types that coincide with entries in an 



on-line bilingual dictionary separately from those that do not (c/. Brown et al. 



|1994 ). When the auxiliary parameters are conditioned on different link classes, 



Step 3 of Method B is repeated for each link class. 



6 Effects of Sparse Data 



The one-to-one assumption is a potent weapon against the ever-present sparse 
data problem. The assumption enables accurate estimation of translational dis- 
tributions even for words that occur only once, as long as the surrounding words 
are more frequent. In most translation models, link likelihood is correlated with 
co-occurrence frequency. So, links between tokens u and v for which like{u, v) is 
highest are the ones for which there is the most evidence, and thus also the ones 
that are easiest to predict correctly. Winner-take-all link assignment methods, 
such as the competitive linking algorithm, can prevent links based on indirect 
associations (see Section |4.1| ) , thereby leveraging their accuracy on the more con- 
fident links to raise the accuracy of the less confident links. For example, suppose 
that ui and U2 co-occur with vi and V2 in the training data, and the model es- 
timates /z/ce(ui,vi) = .05, /iA;e(ui , V2) = .02, and like{u2,'V2) = .01. According 
to the one-to-one assumption, {ui,V2) is an indirect association and the correct 
translation of V2 is U2- To the extent that the one-to-one assumption is valid, it re- 
duces the probability of spurious links for the rarer words. The more the incorrect 
candidate translations can be eliminated for a given rare word, the more likely 
the correct translation is to be found. So, the probability of a correct match for 
a rare word is proportional to the fraction of words around it that can be linked 
with higher confidence. This fraction is largely determined by two bitext proper- 
ties: the distribution of word frequencies, and the distribution of co-occurrence 
counts. I shall explore each of these properties in turn. 

The distribution of word frequencies is a function of corpus size. The words 
in any text corpus are drawn from a large but finite vocabulary. As the corpus 
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gets larger, fewer new words appear, and the average frequency of words already 
in the corpus rises. I took random samples of varying sizes from large text cor- 
pora in French and in English. The corpora comprised news text {Le Monde and 
Wall Street Journal), parliamentary debate transcripts (Hansards) and Sun Mi- 
croSystems software documentation ( AnswerBooks) . Figure |^ shows the log- log 
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Figure 4 

The log-log relationship between corpus size and the proportion of singletons. 

relationship between sample size and the fraction of words (by token) that ap- 
pear in the sample only once. For example, suppose we draw a random sample 
of one million words from Le Monde, and then select a random word type w 
from this random sample. According to Figure ^, the chances are roughly 0.017 
that w appears only once in that one million words. If the sample were only one 
thousand words, however, our chances of drawing a singleton rise to 0.317. The 
nearly linear shape of the log-log curve seems largely invariant across languages 
and text genres, as predicted by Zipf (1936|). Some curves in the graph are higher 



than others, because the language genres from which the corpora were drawn 
have richer vocabularies. For example, the fraction of singleton words is consis- 
tently smaller in the stemmed English Hansards than in the same text when it 
is not stemmed, which is the whole motivation for stemming. Figure ^, based on 
Le Monde text, shows that the log-log relationship holds for higher frequencies 
too. In a larger corpus, a larger fraction of the word types appear more frequently. 



Corpus size determines the probability that a randomly chosen word will 
have a particular frequency. The likelihood of a correct link for a rare word token 
w also depends on one other variable. If w co-occurs with only one rare word 
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Figure 5 

The log-log relationship for higher frequencies. The bottom curve in this graph is the same as 
the top curve in Figure 



(in the opposite half of the bitext), then the competitive hnking algorithm is 
likely to eliminate all of w's indirect associations before it attempts to link w. 
Problems arise only when more than one candidate remains for linking to w. 
What is the probability that w co-occurs with more than one rare word? The 
analysis is easiest under the distance-based model of co-occurrence, where the 
threshold 6 on the distance from the bitext map is specified in words rather than 
in characters ( [Melamed, 1998a ). Suppose that w co-occurs with 7 words in the 
opposite half of the bitext, where 7 is either the vertical or horizontal component 
of (50. Let p be the probability that a word co-occurring with w is rare. Then the 
probability of exactly k rare words co-occuring with w can be approximated by 
a binomial distribution with parameters 7 and p. It follows that the probability 
of more than one rare word co-occurring with w is 



Pr(> 1 rare word co-occuring) = 1 — B{0\^,p) — B{l\^,p). 



(34) 



Figure ^ plots Equation 34 over different values of 7 and p. The range of p 
corresponds roughly to the range of the y-axis in Figures § and |5[ The figure 
illustrates how the power of the one-to-one assumption varies with corpus size. 



7 I.e. 7 is the same as |3agan et al. (1993| ) 



s window width. 
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contours: 




Figure 6 

Probability of more than one rare word co-occurring. 



7 Evaluation 



7.1 Evaluation By Token 

This section compares translation model estimation methods A, B and C to each 



other and to Brown et al. (1993 )'s Model 1. Until now, translation models have 



been evaluated either subjectively [e.g. White &: O'Connell, 1993| ) or using rel- 



ative metrics, such as perplexity with respect to other models ( [Brown et al.. 



1993). More objective and more accurate tests can be carried out using a "gold 
standard." I hired bilingual annotators to link roughly sixteen thousand corre- 
sponding words between on-line versions of the Bible in French and English. This 



bitext was selected to facilitate widespread use and standardization (see [Melamed 



1998d, for details). The entire Bible bitext comprised 29614 verse pairs, of which 



250 verse pairs were hand-linked using a specially developed annotation tool. 



The annotation style guide (Melamed, 1998c) was based on the intuitions of the 



annotators, so it was not biased towards any particular translation model. The 
annotation was carried out 5 times by different annotators. 

A straightforward metric for evaluating a translation model with respect to a 
gold standard can be derived from the recall and precision measures widely used 
in the information retrieval literature. When comparing a set of "test" elements 
X to a set of "correct" elements Y , 

\x n Y\ 

precision{X\Y) = — — — , (35) 
recall{X\Y) = ^^p- (36) 
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X and Y can be fuzzy sets, such as probability distributions, in which case \X\ 
is defined as the sum of the weights of the elements in X and |X n y j is the sum 
of the weights of the elements shared by X and Y . 



Equations 35 and 36 differ only in the set whose size is used as the denom- 



inator. If neither X nor Y is privileged, or if precision and recall are equally 
important, we can compute a symmetric measure of agreement D as the har- 
monic mean of precision and recall: 



D{X,Y) 



J, < 1 

ion{X\Y) ^ RecaU(X\Y) 



2 • |xny[ 
\x\ + \Y\ 



(37) 



D is the set-theoretic equivalent of the Dice coefficient ( Dice, 1945| ) and conve- 
niently ranges from zero to one. 

To reiterate, Model 1 is based on co-occurrence information only; Method A 
is based on the one-to-one assumption; Method B adds the "one sense per collo- 
cation" hypothesis to Method A; Method C conditions the auxiliary parameters 
of Method B on various word classes. Whereas Methods A and B and Model 1 
were fully specified in Section 4.2.1 and Section |5|, the latter section described a 
variety of features on which Method C might classify words. For the purposes of 
the experiments reported in this article. Method C employed the simple classifica- 
tion in Table for both languages in the bitext. All classification was performed 



Class Code Description 



EOS 
EOF 
SCM 
SYM 
NU 

C 

F 



End-Of-Sentence punctuation 

End-Of-Fhrase punctuation, such as commas and colons 
Subordinate Clause Markers, such as " and ( 
Symbols, such as ~ and * 
the NULL word, in a class by itself 

Content words: nouns, adjectives, adverbs, non-auxiliary verbs 
all other words, i.e. function words 



Table 2 

Word classes used by Method C for the experiments reported in this article. 

by table lookup; no context-aware part-of-speech tagger was used. In particular, 
words that were ambiguous between open classes and closed classes were always 
deemed to be in the closed class. The only language-specific knowledge involved in 
this classification method is the list of function words in class F. Certainly, more 
sophisticated word classification methods could produce better models, but even 
the simple classification in Table |2| should suffice to demonstrate the method's 
potential. 

Each of the four methods was used to estimate a word-to-word translation 
model from the 29614 verse pairs in the Bible bitext. All methods were deemed to 
have converged when less than .0001 of the translational probability distribution 
changed from one iteration to the next. The links assigned by each of methods A, 
B and C in the last iteration were normalized into joint probability distributions 
using Equation [l^. I shall refer to these joint distributions as Model A, Model B 
and Model C, respectively. Each of the joint probability distributions was further 
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normalized into two conditional probability distributions, one in each direction. 
Since Model 1 is inherently directional, its conditional probability distributions 
were estimated separately in each direction, instead of being derived from a joint 
distribution. 

The four models' predictions were compared to the gold standard annotations. 
Although the models were evaluated on part of the same bitext on which they 
were trained, the evaluations were with respect to the translational equivalence 
relation hidden in this bitext, not with respect to any of the bitext 's visible 
features. Such testing on training data is acceptable for unsupervised learning 
algorithms. 

Before comparing the accuracies of the different models, it is interesting to 
consider their convergence rates. Figure |7| shows that, although the EM algorithm 
guarantees monotonic convergence for Model 1, it requires more iterations to 
converge on these training data than models A, B and C. To be fair, we must 
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Figure 7 

Convergence rates for Model 1 and Methods A, B, and C. Changes from each iteration to the 
next were measured in terms of the set-theoretic Dice coefficient. 



remember that Method B and Method C take time to estimate their auxiliary 
parameters on each iteration. So, Figure |^ does not say which method is fastest 
in real time. Such a comparison is very dependent on the details of each method's 
implementation. In the current (very inefficient) implementations. Model A con- 
verged in about 6 hours. Model B in about 20 hours. Model C in about 24 hours 
and Model 1 converged in about 27 hours. 

The first evaluation was on "single-best" translation of the kind that some- 
body might use to get the gist of a foreign-language document. The input to the 
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experiment was one side of the gold standard bitext. The output was the model's 
single best guess about the translation of each word in the input, together with 
the input word. In other words, each model produced link tokens consisting of 
input words and their translations. I computed the models' precision and recall 
by comparing the link tokens produced by each model to the link tokens in the 
gold standard. The accuracy of each model was averaged over the two directions 
of translation: English to French and French to English. Figure ^(a) shows that 
each of the innovations introduced in Section ^ improves both precision and re- 
call on this task on these data. The gold standard bitext was actually annotated 
five times by seven different annotators. This replication helped to establish sta- 
tistical significance among the differences in model accuracy. The performance 
differences reported in this section are statistically significant at the a = .05 
level, according to the Wilcoxon signed ranks test. 

Some applications don't care about function words. To get a sense of the 
relative effectiveness of the different translation model estimation methods when 
function words are taken out of the equation, I removed all closed-class words 
(including non-alphabetic symbols) from the models and renormalized the condi- 
tional probabilities. Then, I removed from the gold standard all link tokens where 
one or both of the linked words were closed-class words. Finally, I recomputed 
precision and recall. The results are shown in Figure |8|(b). When closed-class 
words were ignored. Model 1 performed better than Method A, because open- 
class words are more likely to violate the one-to-one assumption. However, the 
explicit error model in Methods B and C boosted their recall and precision signif- 
icantly higher than Model 1 and Method A. As expected, there was no significant 
difference in accuracy between Method B and Method C on this task, because it 
left only two classes for Method C to distinguish: content words and nulls. 

For some applications, it is insufficient to guess only the most likely trans- 
lation of each word in the input. The model is expected to output the entire 
distribution of possible translations for each input word. This distribution is 
then convolved with other distributions that are relevant to the application. For 
example, in cross-language information retrieval, the translational distribution is 
convolved with the distribution of term frequencies. In statistical machine trans- 
lation, the translational distribution can be convolved with a target language 
model ( Brown et al, 1988 ). To see how the different models might perform on 
this "whole distribution" task, I performed a second set of experiments. This 
time, the models generated a whole set of links from each input word, weighted 
according to the probability assigned to each of the input word's translations. I 
computed the precision and recall of the fuzzy sets of links generated by the mod- 
els to the five gold standard annotations as before. I repeated the experiment once 
with closed-class words and once without, and again averaged the results over the 
two directions of translation. The results are in Figure ^, which is plotted on the 
same scale as Figure |^ to facilitate comparison. The only change in the relative 
accuracy of the models was that Methods B and C no longer had significantly 
higher precision than Model 1 when closed-class words were ignored. However, 
all the scores were lower than their counterparts on the "single-best" translation 
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task, because it is more difficult for any statistical method to correctly model the 
less common translations. The "best" translations are usually the most common. 
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Figure 8 

Comparison of model performance on "single-best" translation task, (a) All links; (h) 
open- class links only. 
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Figure 9 

Comparison of model performance on "whole distribution" task, (a) All links; (b) open-class 
links only. 
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To study how the benefits of the various biases vary with training corpus 
size, I evaluated Models A, B, C and 1 on the "whole distribution" translation 
task, after training them on three different-size subsets of the Bible bitext. The 
first subset consisted of only the 250 verse pairs in the gold standard. The second 
subset included these 250 plus another random sample of 2250 for a total of 2500, 
an order of magnitude larger than the first subset. The third subset contained all 
29614 verse pairs in the Bible bitext, roughly an order of magnitude larger than 
the second subset. All models were compared to the five gold standard annota- 
tions. The correlation between recall and precision was very high on this task 
(p = .99), as illustrated in Figure Ma). So, the results can be well represented 



by the set-theoretic Dice coefficient in Equation |37|, as applied to probabilistic 
(fuzzy) sets. The mean Dice scores over the five gold standard annotations are 
graphed in Figure The figure suggests that, at least for French/English trans- 
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Figure 10 

Effects of training set size on model accuracy on the "whole distribution" task. 



lation models, each of the biases presented in this article improves the efficiency 
of modeling the available training data. The one-to-one assumption is useful, even 
though it is tractable only under a greedy estimation method. In relative terms, 
the advantage of the one-to-one assumption is much more pronounced on smaller 
training sets. For example. Model A is 29% more accurate than Model 1 when 
trained on only 250 verse pairs. The explicit error model buys a considerable gain 
in accuracy across all sizes of training data, as do the link classes of Model C. 
In concert, on the gold standard test set, the three biases outperformed Model 1 
by up to 55%. This difference is even more significant given the absolute perfor- 
mance ceiling of 82% established by the inter-annotator agreement rates on the 



gold standard (Melamed, 1998d). 
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7.2 Evaluation By Type 

An important application of statistical translation models is to help lexicogra- 
phers compile bilingual dictionaries. Dictionaries are written to answer the ques- 
tion, "What are the possible translations of X?" This is a question about link 
types, rather than about link tokens. 

Evaluation by link type is a thorny issue. Human judges often disagree about 
the degree to which context should play a role in judgements of translational 



equivalence. For example, the Harper-Collins French Dictionary (Cousin et al.. 



19J|0D gives the following French translations for English appoint: nommer, en- 
gager, fixer, designer. Likewise, most lay judges would not consider instituer a 
correct French translation of appoint. In actual translations, however, when the 
object of the verb is commission, task force, panel, etc., English appoint is usually 
translated into French as instituer. To account for this kind of context-dependent 
translational equivalence, link types must be evaluated with respect to the bitext 
whence they were induced. 

I performed a post-hoc evaluation of the link types produced by an earlier 
version of Method B. The bitext used for this evaluation was the same aligned 
Hansards bitext used by pale fc Church (199l| ), except that I used only 300,000 
aligned segment pairs to save time. The bitext was automatically pre-tokenized 
to delimit punctuation, English possessive pronouns and French elisions. Mor- 
phological variants in both halves of the bitext were stemmed to a canonical 
form. 

The link types assigned by the converged model were sorted by the log- 
likelihood scores in Equation 33. Figure ^ shows the distribution of these scores 
on a log scale. The log scale helps to illustrate the plateaus in the curve. The 
longest plateau represents the set of word pairs that were linked once out of one 
co-occurrence (1/1) in the bitext. All these word pairs were equally likely to be 
correct. The second- longest plateau resulted from word pairs that were linked 
twice out of two co-occurrences (2/2) and the third longest plateau is from word 
pairs that were linked three times out of three co-occurrences (3/3). As usual, the 
entries with higher likelihood scores were more likely to be correct. By discarding 
entries with lower likelihood scores, recall could be traded off for precision. This 
trade-off was measured at three points, representing cutoffs at the end of each of 
the three longest plateaus. 

The traditional method of measuring recall requires knowledge of the cor- 
rect link types, which is impossible to determine without a gold standard. An 
approximate recall measure can be based on the number of different words in 
the corpus. For lexicons extracted from corpora, perfect recall implies at least 
one entry containing each word in the corpus. One-sided variants, which consider 
only source words, have also been used ( Gale fc Church, 1991 ). Table |^ reports 
both the marginal (one-sided) and the combined recall at each of the three cut-off 
points. It also reports the absolute number of (uou-null) entries in each of the 
three lexicons. Of course, the size of automatically induced lexicons depends on 
the size of the training bitext. Table ^ shows that, given a sufficiently large bitext. 
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Distribution of link type log-likelihood scores. The long plateaus correspond to the most 
common combinations of '■ 1/1,2/2 and 3/3. 

Table 3 

Lexicon recall at three different minimum likelihood thresholds. The bitext contained 41,028 
English words and 36,314 different French words, for a total of 77,342. 
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the method can automatically construct translation lexicons with as many entries 
as published bilingual dictionaries. 

The next task was to measure precision. It would have taken too long to 
evaluate every lexicon entry manually. Instead, I took 5 random samples (with 
replacement) of 100 entries each from each of the three lexicons. Each of the 
samples was first compared to a translation lexicon extracted from a machine 
readable bilingual dictionary ( pousin et al., 1991 ). All the entries in the sam- 
ple that appeared in the dictionary were assumed to be correct. I checked the 
remaining entries in all the samples by hand. To account for context-dependent 
translational equivalence, I evaluated the precision of the translation lexicons in 
the context of the bitext whence they were extracted, using a simple bilingual 
concordancer. A lexicon entry (u,v) was considered correct if u and v ever ap- 
peared as direct translations of each other in an aligned segment pair. 
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Direct translations come in different flavors. Most entries that I checked by 
hand were of the plain vanilla variety that you might find in a bilingual dictionary 
(entry type V). However, a significant number of words translated into a different 
part of speech (entry type P). For instance, in the entry (protection, protege), the 
English word is a noun but the French word is an adjective. This entry appeared 
because "to have protection" is often translated as "etre protege" in the bitext. 
The entry will never occur in a bilingual dictionary, but users of translation 
lexicons, be they human or machine, will want to know that translations often 
happen this way. Incomplete entries, described above, were counted in a third 
category (entry type I). 

Table 4 

Distribution of different types of correct lexicon entries at varying levels of recall 
(mean ± standard deviation). 
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Table ^ reports the distribution of correct lexicon entries among the types V, 
P and I. Figure |l^ graphs the precision of the method against recall, with 95% 
confidence intervals. The upper curve represents precision when incomplete links 
are considered correct, and the lower when they are considered incorrect. On the 
former metric, the method can generate translation lexicons with precision and 
recall both exceeding 90%, as well as dictionary-sized translation lexicons that 
are over 99% correct. 
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Figure 12 

Translation lexicon precision with 95% confidence intervals at varying levels of recall. 
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8 Conclusion 



There are many ways to model translational equivalence and many ways to es- 
timate translation models. "The mathematics of statistical machine translation" 
proposed by Brown et al. (1993 ) are just one kind of mathematics for one kind of 
statistical translation. In this article, I have proposed and evaluated new kinds 
of translation model biases, alternative parameter estimation strategies, and gen- 
eral techniques for exploiting pre-existing knowledge that may be available about 
particular languages and language pairs. On a variety of evaluation metrics, each 
infusion of knowledge about the problem domain resulted in better translation 
models. 

Each innovation presented here opens the way for more research. Model biases 
can be mixed and matched with each other, with previously published biases like 
the word order correlation bias, and with other biases yet to be invented. The 
competitive linking algorithm can be generalized in various ways. New kinds of 
pre-existing knowledge can be exploited to effect significant accuracy improve- 
ments for particular language pairs or even just for particular bitexts. It is difficult 
to say where the greatest advances will come from. Yet, one thing is clear from 
our current vantage point: Research on empirical methods for modeling transla- 
tional equivalence has not run out of steam, as some have claimed, but has only 
just begun. 
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